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Abstract: Reduced fc-means clustering is a method for clustering ob- 
jects in a low-dimensional subspace. The advantage of this method is 
that both clustering of objects and low-dimensional subspace reflecting 
the cluster structure are simultaneously obtained. In this paper, the rela- 
tionship between conventional fc-means clustering and reduced fc-means 
clustering is discussed. Conditions ensuring almost sure convergence of 
the estimator of reduced fc-means clustering as unboundedly increasing 
sample size have been presented. The results for a more general model 
considering conventional fc-means clustering and reduced fc-means clus- 
tering are provided in this paper. Moreover, a new criterion and its 
consistent estimator are proposed to determine the optimal dimension 
number of a subspace, given the number of clusters. 
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1. Introduction 

The aim of cluster analysis is the discovery of a finite number of homogeneous 
classes from data. In some cases, a cluster structure is considered to lie in a 
low-dimensional subspace of data, and the following procedure is applied: 

Step 1. Principal component analysis (PCA) is performed, and the first few 

components are obtained. 
Step 2. Conventional fc-means clustering is performed for the principal scores 

on the first few principal components. 

This two-step procedure is called "tandem clustering" by Arabie & Hubert 
(1994) and has been discouraged by several authors (e.g., Arabie & Hubert, 
1994; Chang, 1983; De Soete & Carroll, 1994). Because the first few principal 
components of PCA do not necessarily reflect the cluster structure in data, 
the appropriate clustering result may not be obtained by using the tandem 
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Fig 1 . First two dimensions of the principal component analysis and result of the tandem 
clustering (Black points represent the mis classification objects). 

clustering approach. Figure 1 shows that the first two principal components 
do not reflect the cluster structure, and the clustering result of the tandem 
clustering is incorrect. De Soete & Carroll (1994) proposed reduced k- means 
(RKM) clustering. RKM clustering simultaneously determines the clusters of 
objects on the basis of the /c-means criterion and the subspace that is informa- 
tive about the cluster structure in data on the basis of component analysis. 
In other words, for given data points Xi, . . . , x n in R p , the fixed cluster 
number k and the dimension number of subspace q (q < min{A; — 1, p}), 
RKM clustering is defined by the minimization problem of the following loss 
function: 



where fj € M. q and A is a px q columnwise orthonormal matrix. For some clus- 
tering methods related to fc-means clustering, several authors have discussed 
their statistical properties (e.g., Abraham et al., 2003; Garcia-Escudero et al., 
1999; Pollard, 1981; Pollard, 1982; von Luxburg et al, 2008). However, be- 
cause RKM clustering is proposed in the framework of descriptive statistics, 




(1) 



Y. Terada/ 'Consistency of RKM Clustering 



3 



the statistical properties are not discussed. When data points are indepen- 
dently drawn from a population distribution P, the objective function is 
rewritten as 



where F is a set containing k or fewer points in M 9 , and P n is the empirical 
measure obtained from the data. For each fixed F and A, the strong law of 
large numbers (SLLN) shows that 



lim RKM(F, A, P n ) = RKM(F, A, P) : = / min \\x - Af\\P(dx) a.s. 



Thus, we wish to ensure that the global minimizer of RKM(-,-, P n ) con- 
verges almost surely to the global minimizers of RKM(-, -, P), say the 
population global minimizers. 

In this paper, the strong consistency of RKM under i.i.d. sampling is 
proven. For this purpose, the framework of the proof of the strong consistency 
of the fc-means clustering approach proposed by Pollard (1981) is used; in 
this framework, the existence and uniqueness of the population global mini- 
mizers are assumed for consistency. Conditions for the existence of the global 
minimizers are not discussed. For RKM clustering, the uniqueness of the pop- 
ulation global minimizers cannot be assumed because RKM clustering has 
rotational indeterminacy. Therefore, the sufficient condition for the existence 
of the population global minimizers must be derived; it is also necessary to es- 
tablish that the distance between the sample estimator and the set of global 
minimizers converges almost surely to zero, as the sample size approaches 
infinity. 

This paper is organized as follows. In Section 2, the original algorithm 
of RKM clustering and visualization of the result are described. Then, the 
relationship between the conventional /c-means clustering method and RKM 
clustering is presented. The notation and some properties of RKM, including 
the rotational indeterminacy, is introduced in Section 3. The uniform SLLN 
and continuity of the objective function of RKM clustering are presented in 
Section 4. In Section 5, conditions for the existence of the population global 
minimizers are determined, and a theorem regarding the strong consistency 
of RKM clustering is stated. In Section 6, the main proof of the consistency 
theorem is explained. In Section 7, a new criterion and its consistent estimator 
are proposed to determine the optimal dimension number of a subspace, given 
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the number of clusters. Moreover, the effectiveness of the criterion through 
numerical experiments are illustrated. 

2. Reduced fc-means clustering 

2.1. Algorithm and visualization of reduced k-means clustering 

Let X = (xij) nxp be a data matrix and cCj (i — 1, . . . , n) be row vectors of 
X, where n is the number of objects and p is the number of variables. The 
number of clusters and components to which the variables are reduced are 
denoted by k and q, respectively. RKM clustering is defined as the minimizing 
problem of the following criterion: 

n 

RKM n (A, F, U\k, q) := \\X - UFA T \\ 2 F = V min \\x t - AfA\ 2 , (2) 

^— ' l<j<k 
1=1 

where || • || and || • \\p denote the usual Euclidean norm and Frobenius norm, 
respectively, U = (^)nxfc is a binary membership matrix that specifies clus- 
ter membership for each objects, A = (aij) pxq is a column-wise orthonormal 
loading matrix, F = (fij)kxq is a centroid matrix, and fj is a centroid of the 
jth cluster for each j — 1, . . . , k. For example, this problem can be solved 
by the following alternating least square algorithm: 

Step 0. First, initial values are chosen for A, F, and U. 

Step 1. QHP T is expressed as the singular value decomposition of (UF) T X, 
where Q is a q x q orthonormal matrix, E is a q x q diagonal matrix, 
and P is a p x q columnwise orthonormal matrix. A is updated by PQ T . 

Step 2. For each i — 1, . . . , n and each j = 1, . . . , k, we update Uij by 



Uij 



1 iff \\A T Xi - fj\\ 2 < \\A T Xi - f jt \\* for each / ^ j, 
otherwise. 



Step 3. F is updated using {U T U)~ l U T XA. 

Step 4. Finally, the value of the function RKM n for the present values of 
A, F, and U is computed. When the present values have decreased the 
function value, A, F, and U are update in accordance with Steps 1-3. 
Otherwise, the algorithm has converged. 

Other formulations and algorithms for RKM clustering have been presented 
by De Soete & Carrol (1994) and Timmerman et al. (2010). 
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The algorithms for RKM clustering monotonically decrease the function 
RKM n . As shown below, because RKM n is bounded, the solution for each 
iteration converges to a local minimum point. Because of the binary con- 
straint on U, the solutions of these algorithms may often be local minimums. 
To prevent this, many random starts are required to be used. 

The objective function RKM n can be decomposed into two terms: 

RKM n (A, F,U\k,q) = \\X - XAA T \\ 2 F + \\XA - UFf F . (3) 

The first term of equation (3) is the objective function of the PCA, and the 
second term is the /c-means criterion in a low dimensional subspace. Thus, 
for optimal solutions A, F, and U, we have F = (U T U)~ 1 U T XA. Using the 
optimal solutions A, F, and U, the low-dimensional representation of the 
objects and cluster centers can be obtained: 

Y := XA and G := (U T U)- 1 U T Y. (4) 

Using Y and A, a biplot reflecting the cluster structure can be presented. 
Figure 2 shows the biplot of the RKM clustering for the same data as that 
used in Figure 1. 

2.2. The relationship between the conventional k-means and the 
RKM clusterings 

The objective function of the conventional fc-means clustering method is given 
by 

KM n {C, U\k):= \\X - UCf F , (5) 

where C is an k x p cluster center matrix. PSQ T is expressed as the singular 
value decomposition of C, where P is an k x k orthonormal matrix, E is an 
k x k diagonal matrix, and Q is a p x k column-wise orthonormal matrix. 
Function (5) can be expressed as 

\\X - UC\\ 2 = \\X - UP^Q T \\ 2 F . 

Considering PE and Q as a low- dimensional centroid matrix F and a loading 
matrix A, respectively, function (5) is equivalent to the objective function of 
RKM, RKM n (A, F, U \ k, k). Thus, RKM clustering includes the conven- 
tional /c-means clustering analysis CIS db special case. 
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Fig 2. Biplot of the result of RKM clustering for the same data set as Figure 1 (Black 
points represent the misclassification objects). 

3. Preliminaries 

Let J 7 , P) be a probability space and Xi, . . . , X n be independent 
random variables with a common population distribution P on R p ; let P n be 
the empirical measure based on X±, . . . , X n . For typographical convenience, 
the set of all p x q column- wise orthonormal matrices are denoted by 0(p x q) , 
and TZ k := {flcR" #{R) < k}, where #(R) is the cardinality of R. Thus, 
the parameter space is denoted by := IZk x 0(p x q). B q {r) denotes the 
g-dimensional closed ball of radius r centered at the origin. For each M > 0, 
define H%(M) := {R C B q (M) \ #(R) < k} and Q%(M) := n* k (M) x 
0(p x q). Let <fi : R — > R be a non-negative decreasing function and Q 
be a probability measure on R p . For each finite subset F C R 9 and each 
A G 0{p x q), the loss function of RKM with Q is defined by 




Write 



m k (Q) := inf A, Q) and ml(Q \ M) := inf A, Q). 

(F, A)es fc (F A)ee*(M) 
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For 6 = (F, A) £ E k , both descriptions $(6>, Q) and $(F, A, Q) are used. In 
addition, & := {9 £ E k \ m k (P) = $(0, P)} and G' n := {9 £ E k \ m k (P n ) = 
$(0, P n )}. For each M > 0, 6* := {9 £ Q* k (M) | m* k (P n \ M) = <$>(0, P n )} 
and 0* := {9 £ 6^(M) | m*(P n | M) = $(0, P n )}. The parameters ©'(fc) 
and Q' n (k) are used to emphasize that 0' and 0^ are dependent on the index 
k. One of the measurable estimators in 0' n will be denoted by 9 n or 9 n (k). 
Similarly, we will also denote one of the measurable estimators in G* by 9* n 
or To illustrate the existence of measurable estimators, see Section 6.7 

of Pfanzagl (1996). 

Let (If(-, •) be the distance between two matrices based on Frobenius 
norm and dn{-, ■) the Hausdorff distance, which is defined for finite subsets 
A, B cW as 

du{A, B) := max < min \\a — b\ 

Moreover, let d be the product distance with dp and In this paper, the 
distance between 9 n and G' is defined as 

d(9 n , G') := M{d(§ n , B) | 9 £ G'}. 

To clarify the minimization procedures, the function must satisfy some 
regularity conditions. As proposed by Pollard (1981), it is assumed that <fi is 
continuous, and 0(0) = 0. Moreover, to control the growth of <fr, it is assumed 
that 

3A > 0; Vr > 0; <j>(2r) < X(f>{r). 
For each / £ W and each A £ 0(p x q), 

x - Af\\)P(dx) < j<t>(\\x\\ + \\Af\\)P(dx) = J <t>(\\x\\ + \\f\\)P(dx) 

<j>(2\\f\\)P(dx)+ [ <j>{2\\x\\)P{dx) 

\\f\\>M ^ll/ll<!l^ll 

<0(2||/||) + A I <j>{\\x\\)P{dx). 



Therefore, as long as J (j)(\\x\\)P(dx) is finite, $>(F, A, P) is also finite for 
each F and each A £ 0(p x q). 

Let R be a q x q orthonormal matrix, i.e., R T R = RR T = I q . For each 
/ £ R q and each A £ 0(p x q), 

J <f>(\\x - Af\\)P(dx) = J <P(\\x-AR T Rf\\)P(dx). 
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It follows that 0' is not a singleton when 0' ^ 0, thus suggesting that RKM 
clustering has rotational indeterminacy. 

4. The uniform SLLN and the continuity of <£(■, ■, P) 

Proposition 1. Let M > be an arbitrary number. Let Q denote the class 
of all P-integrable functions on M p of the form 

9(F,A)(x) := mm (j)(\\x - Af\\), 

where (F, A) takes all values over Q* k (M). Suppose that J <f)(\\x\\)P(dx) < 
oo. Then, 



lim sup 



g(x) P n (dx) - J g(x) P(dx] 



a.s. (6) 



Proof. DeHardt (1971) provided the sufficient condition for the uniform SLLN 
(6); for all e > 0, there exists a finite class of functions Q t such that for each 
g G Q, g and g exist in Q e with g < g < g and J g{x) P(dx) — J g(x) P(dx) < 
e. 

An arbitrary e > is selected, and S pxq (y/q) denotes the surface of the 
sphere on M. pxq of radius y/q centered at the origin. To find such a finite class 
Q e , Ds ± is defined as the finite set of M. q satisfying 

VfeB q (M); 3geD Sl ; \\f - g\\ < 5 1 

and Apxq, s 2 as the finite sets of S pxq (y/q) satisfying 

VA G S pxq (y/q); 3B G A pxq , s 2 ] \\A-B\\ F <5 2 . 

Define TZ kt5l := {F G 1Z* k (M) \ F C D 5l }. Take G e as the finite class of 
functions of the form 

mm</>(\\x-A , f\\ + y/q5 1 + M5 2 ) or mm <j>(\\x - A'f\\ - ^/q5 1 - M5 2 ), 

where (F 1 , A') takes all values over TZk,s 1 x A pxq , s 2 and 0(r) is defined as 
zero for all negative r < 0. 

For given F = {/i, . . . , f k } G 7££(M) and A G 0(p x q), there exists 
F' = {f[, . . . , f' k } G 7Zk, 6! with \\fi — fl\\ < 5i for each i and each A' G 
•A pxqi $ 2 with \\A — A'\\ F < 5 2 - Corresponding to each Q(f a) G G, choose 

9(f, A) ■= mm0(||a3 - A'f\\ + ^/qS 1 + M5 2 ) 
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g {F)A ) := mm4>(\\x - A'f\\ - y/q8 x - M5 2 ). 



Because is a monotone function and 

||ac - A'fi\\ - y/qSx - M5 2 < \\x - Af t \\ < \\x - A! f[\\ + y/qSi + M5 2 
for each i and each x G MP, these functions ensure that g<F,A) < 9{f,A) < 
If we choose R > to be greater than y/q5i + M5 2 + My/q, 



J [g(F, a) (x) - g {F , a) (*)] P(da:) 



< 



i=l 



x - A'/,'!] + + M5 2 ) 
■0(||x-A7;i|-^i-^ 2 



P(dx) 



<k sup sup sup 

l|a=ll<R f&B{bM) AeS pxq (^q) L 



0(||a;-A/|| + ^i + M5 2 
x-Af\\ -^5 1 -M5 2 



-2k\ / 0(||aj||)P(daj). 

J\\x\\>R 

The second term would be less than e/2 if R is sufficiently large. Moreover, 
because is uniform continuous on a bounded set, the first term can be 
less than e/2 if Si, S 2 > is sufficiently small. Thus, the uniform SLLN is 
proven. □ 

Similarly, the continuity of $(■, P) on ®* k (M) can be proven. 

Proposition 2. Let M > be an arbitrary number. Suppose that J <p(\\x\\)P(dx) 
Then, $(■, P) is continuous on ®* k (M). 

Proof. If (F, A), (G, B) G <d* k are select such that d H (F, G) < Si and 
\\A — B\\p < 5 2 , then for each g G G, there exists G F with ||flf— flf(/) || < 
5i, and furthermore, 



A, P) - B, P) 



m\n(j){\\x - Af\\) - min(f)(\\x - Bg\ 



P{dx) 
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< f max[<j>(\\x-Af{g)\\)-(f>{\\x-Bg\\))P{dx) 

J 9^G 




< k sup max[0(||aj - Bg\\ + M5 2 + 61) - (f>(\\x - Bg\\)] 

\\x\\<R 9 eG 

+ 2k\ [ 4>(\\x\\)P(dx) (7) 

J\\x\\>R 

for R > 8\ + M{1 + 82)- When a sufficiently large R and a sufficiently small 
<5i, 62 > are selected, the last bound is less than e. For each f E F, there 
also exists /(</) G G with ||/ — /(^)|| < 8\. Therefore, the other inequality 
necessary for the continuity is obtained by interchanging (F, A) and (G, B) 
in the inequality (7). □ 

5. The consistency theorem 

5.1. The existence of the population global optimizers 

The aim of this paper is to prove that, for a fixed measure P satisfying some 
natural assumptions, the infimum distance between the (measurable) esti- 
mator 9 n with = mk(Pn) and parameters achieving rrik(P) converges 
almost surely to 0, as the sample size goes to infinity. However, there may 
be no such parameters. Thus, before providing the consistency theorem, the 
sufficient condition for the existence of parameters achieving rrik(P) in is 
provided. The following proposition ensures the existence of such parameters. 
The proof and some details about the proposition are given in Appendix A. 

Proposition 3. Suppose that J (f)(\\x\\)P(dx) < 00 and that rrij(P) > nrik(P) 
for j = 1, 2, . . . , jfc - 1. Then, 6' ^ 0. 

From Lemma 4 in Appendix A, there exists M > such that F C B q (5M) 
for all (F, A) E 0'. Moreover, under the assumption of Proposition 3, the 
following identification condition can be proven: 

inf $(0, P) > inf $(0, P) for all e > 0. 

0ee*(5itf):d(0, e')>e eee' 

The proof of the identification condition is also given in Appendix A. The 
identification condition is used in Section 6. 
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5.2. Strong consistency of reduced k-means clusterings 

If the parameter space is 0£(M), the strong consistency of RKM clustering 
can be proven. Note that since 0£(M) is compact, we have 0* 7^ and the 
identification condition: 

inf $(0, P) > inf $(0, P) for all e > 0, 

6ee:(M) eee* 

where 0*(M) := {0 G 0£(M) | d(0, 0*) > e}. 

Proposition 4. Suppose that J (j>(\\x\\)P(dx) < 00. Then, for each M > 0, 
lim d(9* n , 0*) = a.s., and lim m* k (P n \ M) = m* k (P | M) a.s. 

Proof. Since the uniform SLLN and the continuity of $(•, P), the proof of 
this proposition is given by the similar argument of the proof of the following 
consistency theorem. □ 

In a study by Pollard (1981), the uniqueness of the parameter is also 
assumed for the strong consistency theorem. As discussed in Section 3, we 
cannot assume the uniqueness condition. Thus, the condition that rrij(P) > 
rrik(P) for j — 1, 2, . . . , k— 1 is assumed instead of the uniqueness condition. 

This condition is equivalent to the distinctness condition that F(k) has k 
distinct points for all (F(k), A(k)) 6 Q'(k). Indeed, suppose that there exists 
9 = (F(k), A(k)) G Q'(k) such that F(k) have k — 1 or fewer distinct points; 
that is, #(F(k)) < k. There exists i e N such that % < k and 9 G E^. Then, 
rrii(P) = rrik(P), which contradicts to rrii(P) > m^P). Thus, the condition 
that rrij(P) > rrtfc(P) for j — 1, 2, k — 1 implies the distinctness 

condition. Moreover, this condition is equivalent to mk-i(P) > rrik(P) since 
m k{P) — Tni(P) for each k, I G N satisfying k < I. 

The following main theorem gives the sufficient condition for the strong 
consistency of the estimator of RKM clustering. 

Theorem 1. Suppose that j 0(||a;||)P(d£c) < 00 and that rrij(P) > rrtk{P) 
for j = l, 2, . . . , k - 1. Then, 0' ^ 0, 

lim d(9 n , 0') = a.s., and lim m k (P n ) = m k {P) a.s. 

n— ¥00 n— >oo 
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6. Proof of Theorem 1 

Because almost sure convergence is dealt with, null sets of elements exists for 
which the convergence does not hold. Hereafter, Q\ denotes the set obtained 
by avoiding a proper null set from Q. In the first step of the proof, when 
n is sufficiently large, the estimators of the cluster centers are contained 
within a compact ball that does not depend on u G fl. For convenience, it is 
assumed that <p(r) — > oo as r — > oo. When <fi is bounded, the proof is a little 
complicated. 

First, we prove the following lemma. 

Lemma 1. Suppose that J <f)(\\x\\)P(dx) < oo. Then, there exists M > 
such that 

(oo oo \ 
|J fl {u | V(F m , A rn ) G 6' m ; F m (cj) H B q (M) ^ 0} ) = 1. 
n=l m=n / 

Proof. Select an appropriate value r > to satisfy the condition that the 
ball B p (r) has positive P measure, i.e., P(B p (r)) > 0. Let M be sufficiently 
large for satisfying M > r and 



0(M - r)P(B p (r)) > j <f>(\\x\\)P(dx). (8) 

From the definition of 6 n = (F n , A n ), $(F n , A n , P) < $(F , A, P) for any 
set F containing at most k points and any A G 0(p x q). The parameter F 
is chosen such that it only consists of the origin. Then, by SLLN, 

$(F , A, P n ) = j ct>{\\x\\)P n {dx) J ^(||a;||)P(daj) a.s., 

for each A e 0(p x q). 

Let Q! := {w G fli | Vn £ N; 3m > n; 3(F m , A m ) e Q' m , F m (u) n 
B q (M) = 0}. By the axiom of choice, for an arbitrary u G Q' there exists a 
subsequence {?t.;}; 6 n such that n s < n t (s < t) and F ni fl B q (M) = 0. Thus, 



limsup$(F ni , An , P ni ) > limsup— V] min <p(\\Xi - A n f. 
i i ni . ,.. jt^u , i<i<fc 

!6{i|Xi 6Bp(r-)} 

> limsup — 0(M — r) 



ie{i|XieB p (r)} 
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= <f)(M-r) lim sup P ni (B p (r)) = <f>(M -r)P(B p (r)). 
i 

On the other hand, limsupj rrik(P ni ) < lim/ $(F , A, P n J because m^{P ni ) < 
$(F , A, P„, ; ). Therefore, we have lim sup l mk(P ni ) < / ^(II^ID-P^ 35 ) an d 
limsupj $(F n( , A rep P n; ) > f 0(||a;||)P(d£c), which is a contradiction. There- 
fore, P(fi') = 0, that is, 

(oo oo \ 
|J fl {cj | V(P m , A m ) e 6' m ; F m (w) n B q {M) ^ 0} J = I. 
n=l m=n / 

□ 

Without loss of generality, all F n can be assumed contain at least one 
point of B q (M) when n is sufficiently large. The next lemma shows that for 
sufficiently large n, there exists M > such that the closed ball B q (5M) 
contains all estimators of centers. When k = 1, the next lemma is obviously 
satisfied. 

From the results in Section 4 and using the same arguments in the final 
part of this section, the conclusions of the theorem are proven when k — 1. 

Lemma 2. Under the assumption of the theorem, there exists M > such 
that 

(oo oo \ 
[J fl {cu I V(P m „ A m ) G 6' m ; F m {u) C B q (5M)} J = 1. 
n=l m=n / 

Proof. Choose M > sufficiently large to satisfy the inequality (8) and 

A / (f)(\\x\\)P(dx) < e, (9) 

where e > is selected to ensure e + m k (P) < m fc _ 1 (P). Note that rrij(P) < 
m*(P | M) for j e N. 

Suppose that F n contains at least one center outside B g (5M) and consider 
the effect on 5>(P n , A, P n ) by deleting such outside centers from F n for 
all A e 0(p x q). From Lemma 1, all F n contain at least one center on 
B q (M) when n is sufficiently large, say f\. In the worst case, the cluster of 
fi G B q (M) should contain all sample points belonging to clusters outside 
B q (5M). Because these points must be outside B(2M), the increment of 



Y. Terada/ Consistency of RKM Clustering 14 

$(F n , A, P n ) due to the deletion of centers outside B q (hM) from F n would 
be at most 

cj>{\\x-Af x \\)P n {dx)< [ 0(11*11 + ||/i||)P n (tte) 

x\\>2M J\\x\\>2M 



< [ <j>(2\\x\\)P n (dx) 



< A / (j>{\\x\\)P n {dx). 

J\\x\\>2M 

Denote the set obtained by deleting centers outside B q (5M) from F n by F* 
For each A G 0(p x q), (F*, A) is contained in B^^M), and thus, 

$(F n *, A, P n ) > m* k _ x (P n | 5M) > m k ^(P n ). 

For each x satisfying ||a;|| < 2M and each A G 0(p x g), we have 

Hcc-A/H >3M for all / £ £,(5M) 



and 
Thus, 



\x-Ag\\<3M for all g G B q (M). 



min (f>{\\x-Af\\)P n (dx) = / min <f>(\\x - Af\\)P n {dx) 



x\\<2M f eFn J\\x\\<2M f eF n 

for all A G 0(p x q). Note that 

lim m* k _ x {P n | 5M) = m^P | 5M) a.s. 

n— >oo 

by Proposition 4. 

Let ft* := {w G Oi | Vw G N; 3m > n; 3(F m , A m ) G 9^; F m (w) <f_ B q (5M)}. 
By the axiom of choice, for an arbitrary oj G ft* there exists a subsequence 
{n^}/ 6 N such that n s < n t (s < t) and (f. B q (5M). For any F with k 

or fewer points and any A G C(p x q), 

m* k _ x (P | 5M) < hminf A nt , P n% ) < limsup A ni , P m ) 

1 i 

= limsup / min (f)(\\x - A n J\\)P ni {dx) 

i U\\x\\<2M f &Fn i 
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+ f min 4(\\x - A n J\\)P ni (dx) 

J\\x\\>2M J^^i 

$(F n , A n , P n ) + X [ <P(\\x\\)P n (dx) 

J\\x\\>2M 



< limsup 



< limsup A, P n ) + A / <f>(\\x\\)P(dx). (10) 

n ./||sc||>2Af 

Set (F, A) e 0'; that is, rrik(P) = <&(F, A, P). From the requirement of 
M > in the inequality (9) and SLLN, the last bound of the inequality (10) 
is less than 

$(F, A, P) + e = m k (P) + e < m k ^(P). 
This is a contradiction. Thus, the following is obtained 

/ oo oo \ 

□ 



kn=l m=n 



For sufficiently large n, all F n values satisfying 

inf $(F n , A, P n ) = m k {P n ) 

AeO(pxq) 

lie in lZ* k (5M). From Proposition 3 and Lemma 4, 7££(5M) contains all op- 
timal sets satisfying 

inf $(F, A, P) = m k (P). 

AeO(pXq) 

It also follows that Pollard (1981) assume that it is large enough to satisfy 
that lZ* k (5M) contains the optimal cluster centers, as the requirement on M, 
but this requirement is also unnecessary. 

In a similar way of Theorem 5.14 (van der Vaart, 1998), if we obtain the 
continuity of $(■, •, P) and the uniform SLLN, i.e., 

sup |$(F, A, P n ) - $(F, A, P)\ ^ 0, 

(F, A)ee*(5M) 

the theorem is completely proven. 
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Let 



Or, 



9 n tf9 n ee* k (5M) 
9, if 9 n e* k (5M) 



where G Q* k (5M) is chosen to ensure cZ(0*, 6') > 0. Then, for a sufficiently 
large n, 9 n = 9 n by Lemma 2, and the following condition is obtained 



lim sup 



*(0„, P n ) - inf $(0, P n ) 
6»ee' 



< a.s. 



Since limsup n $(6> , P n ) = $(0q, P) (= m k (P)) for any fixed O G 6', 



lim sup inf $(0, P n ) < limsup $(00, P n ) = m k (P) a.s. 
n see' n 



Thus, 



> limsup$(0 n , P n ) -limsup inf $(0, P n ) 
n n fee' 

> lim sup $(0 re , P n ) — rrik(P) a.s. 



(11) 



Let 6*(5M) := {0 G 6*(5M) | d{9, 0') > e} for each e > 0. From the 
uniform SLLN, 



lim inf inf <S>(6, PJ > inf <S>(6, P) a.s. 

n eee*(5A/) 6>ee*(5M) 



(12) 



for all e > 0. An arbitrary e > is selected. From Corollary 1 and the 
inequalities (11) and (12), we have 



liminf inf $(0, P n ) > limsup $(0 n , P n ) a.s. 

n 6»e6*(5M) n 



(13) 



That is, for any u 6 H satisfying the inequality (13), there exists n G N 
such that 

inf $(0, P n ) > $(0„, P n ) = $(0~ re , P n ) 
eee*(5M) 

for all n > no- Conversely, suppose that there exists n > uq such that 
d(0 n , 6') > e. Then, we obtain 



inf $(0, P n ) = $(0 n , P n ), 
eee*(5M) 
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which is a contradiction. Thus, we obtain that d(9 n , Q') < e for all n > no- 
That is, 

lim d(6 n , 0') = a.s. 

is proven. From the continuity of <&(•, P), the following is obtained: 

lim m k (P n ) = m k (P) a.s. 



7. Selection of the number of dimensions 



In RKM clustering, the numbers of clusters and dimensions, k and q, have to 
be appropriately determined such that the cluster result can be optimized. 
For determining the number of cluster, Wang (2010) proposed a new selection 
criterion based on clustering stability. This criterion can be applied for de- 
termining other turning parameters with some clustering method (e.g., Sun 
et al, 2012). 

In this section, we propose a new simple criterion for determining the 
number of dimensions under given cluster number, which is not based on 
clustering stability. We also propose a consistent estimator of the criterion. 
Moreover, we illustrate the effectiveness of the criterion through numerical 
experiments. 



7.1. New criterion for determining the number of dimensions 

First, we define a variance ratio criterion for a population distribution P by 

VR(a\P)-= inf S^ f , F \\A T x-f\\ 2 P{dx) 
nyq \ r } . ^nn & ^ ^ _ AT ^ 2p ^ x) > 

where /x = J xP(dx). 

Here, we assume that the population global optimal coefficient matrices 
are determined uniquely without the rotational indeterminacy of A, that is, 
there exists (F , A ) G 6' such that for all (F, A) e 6' there exists R G 0(q) 
such that A = AR. Let (F, A), (F m , A») 6 8' with F ^ or A ^ A*. We 
have $(F, A, P) = A*, P) and / \\A T x\\ 2 P(dx) = J \\Ajx\\ 2 P(dx). 

Since 

$(F, A, P) — J \\x\\ 2 P{dx)- J \\A T x\\ 2 P{dx) + J mm\\A T x-f\\ 2 P(dx), 
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we obtain 

/ min f&F \\A T x- f\\ 2 P(dx) _ J mm ft&Ft \\ A?x - f*\\ 2 P(dx) 
J\\A T x-A T fi\\ 2 P(dx) ~ J\\A^x-A^\\ 2 P(dx) ' 

Unfortunately, we cannot obtain the value of this criterion since the pop- 
ulation distribution is unknown. However, we can construct a consistent es- 
timator of VR(q | P). We define a estimator of VR(q \ P) by 

f min; c p \\AEx — fJI 2 PJdx) 
VR(q | P n ) := J fn£FU n J " 1 ' 



j \\Alx - Al^P n {dx) 



where 6 n = (F n , A n ). The following theorem gives the sufficient conditions 
of the strong consistency of the estimator VR(q \ P n ). 

Theorem 2. Suppose that j <f)(\\x\\)P(dx) < oo and mi(P) > rri2(P) > 
■ ■ ■ > rrik(P). Then, 

J \\A T x - A T v\\ 2 P(dx) > for all (F, A) G 0' 

and 

lim VR(q \ P n ) = VR(q | P) a.s. 
Proof. Without loss of generality, we assume n = 0. First, we prove 

J \\A T (x - ii)\\ 2 P(dx) > for all (F, A) G 6'. 

Conversely, suppose that there exists (F, A) G Q'(k) such that J \\A T x\\ 2 P(dx) 
0. Then, ||y4 T a;|| 2 = for all x in the support of P. Since 

A, P) = J \\x- AA T x\\ 2 P{dx) + J mm\\A T x- f\\ 2 P(dx), 

F must contain zero. Let F := {0} G TZi and then mfc(P) = $(F , A, P) > 
mi(P). This is a contradiction. 

Next, we prove the consistency of VR(q \ P n ). From Theorem 1, we have 

lim d{0 n , 9') = a.s. 
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In the similar way as the proof of the uniform SLLN (6), we obtain 



lim sup 

n ^°° AeO(pxg) 



J \\A T x\\ 2 P n {dx)- J \\A T x\\ 2 P(dx] 



a.s. 



(14) 



and 



lim sup 



min \\A T x - f\\ 2 P n (dx) - / mm \\A T x - f\\ 2 P(dx] 



a.s. 
(15) 



Let 9 n = (F n , A n ) and (F, A) e 6'. We have 



lim 

n— >oo 



J \\Alx\\ 2 P n {dx) = J \\A T x\\ 2 P{dx) 



a.s. 



and 



lim / min \\A n x - f n \\ P n (dx) = / min \\A x — /II P(dx) a.s. 



Therefore, we obtain 



lim VR{q \ P n ) = VR{q | P) a.s. 



□ 



If the number of dimensions is determined larger than the optimal one, the 
subspace of RKM may be influenced from noise variables which do not have 
cluster structure. Let be the optimal number of dimensions. Define VR(0 \ 
P) := and VR(q \ P) := VR(q — 1 | P) for q = min{/c — 1, p}. Forward 
difference at g*, A + (g) := VR(q* + 1 | P) — VR(q* \ P), may be quite larger 
than backward difference at g*, A_(g) := VR(q* \ P) — VR(q* — 1 | P). That 
is, for the optimal number of dimensions g*, second order central difference 
at g„ A 2 (g,) := VR(q* + 1 | P) - 2VR(q* \ P) - VR(q* - 1 | P), may be 
larger than second order central difference at q (q ^ g*). For example, we 
may estimate the optimal number of dimensions by 

g := argmaxA 2 (g), 



where A 2 (g) := VR{q + 1 | P) - 2VR{q \ P) - VR{q - 1 | P). 
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7.2. Numerical experiments 

In this subsection, we examine the effectiveness of the criterion through nu- 
merical experiments. Let K be the number of clusters, q be the number of 
dimensions of the low dimensional space, p\ be the number of the informa- 
tive variables, P2 be the number of the correlated noise variables, and p% be 
the number of the independent noise variables. Denote O pxq be the p x q 
zero matrix. The pi x q column wise orthogonal matrix is generated ran- 
domly, say A*. K cluster centers in low-dimensional space are independently 
generated from the g-dimensional uniform distribution on [—15, 15] 9 , say 
fk (k = 1, ••• , K). Cluster indicators are independently generated from 
the multinomial distribution for K trials with equal probabilities, say Ui = 
(liji, Uix) (i — 1, n). Set A = [A^, O qx (p 2 +p 3 )] T > S P2 = {c T ij) P2 xp 2 
with an = 1 and cr^- = 0.25 (i ^ j), and 

Op 2 Xpi ^p 2 Op2><P3 

Op3~X-Pl ^P2Xp3 Ips 

The simulated data of n observations, Xi e MP (i = 1, . . . , n), are gener- 
ated as 

K 

fc=i 

where are generated from the p-dimensional normal distribution N(0, E p ). 
Let X = [xi, . . . , a; n ] T and Z be the normalized data matrix with zero means 
and unit variances. 

Here, we set K = 8, n = 400, q = 2 or 3 and p\ = p 2 = p^ = 5 or 10. 
We make 1000 data sets for each setting, respectively. Figure 3 shows hidden 
cluster structure XA of the one of data set with setting n = 400, q = 2, and 
Pi — P2 = P3 — 5. Figure 4 shows the first two principal components of PCA 
for Z, which is the same data set of Figure 3 and also shows that the first two 
principal components do not reflect the cluster structure. Moreover, Figure 5 
shows the subspace of RKM with q = 2 for Z, which is the same data set of 
Figure 3. Figure 6 shows the adjusted rand indexes (ARI), which is proposed 
by Hubert and Arabie (1985), of RKM clustering with each number of di- 
mensions of subspace. In Figure 6, we can see that the number of dimensions 
of the subspace is quite important to the clustering result. Figure 7 and 8 
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Fig 3. Hidden cluster structure XA of the one of data set with setting n = 400, q = 2, 
and pi = p 2 = Pi = 5. 




Fig 4. First two dimensions of the principal scores of PCA for Z, which is the same data 
set of Figure 3 (ARI of the tandem clustering with first two principal scores is 0.26). 
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Fig 5. The subspace of RKM for Z, which is the same data set of Figure 3 (ARI of the 
RKM clustering with q = 2 is 0.99). 




Fig 6. ARI scores of RKM and tandem clustering with q = 1, 2, . . . , 7 for Z , which is 
the same data set of Figure 3. Solid line is corresponded to ARI scores of RKM clustering 
and dash line is corresponded to ARI scores of tandem clustering. 
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1 2 3 4 5 

The number of dimensions 



Fig 7. VR{q) scores of RKM with q = 1, 2, 
Figure 3. 



7 /or Z, which is the same data set of 




The number of dimensions 



Fig 8. A 2 (q) scores of RKM with q = 1, 2, 
Figure 3. 



7 /or .Z, which is the same data set of 
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Table 1 

Agreement rates with each setting for 1000 data sets. 



q pi — P2 = P3 agreement rate 



2 5 0.84 (837/1000) 

2 10 0.95 (947/1000) 

3 5 0.73 (726/1000) 
3 10 0.89 (890/1000) 



show that VR(q) and A 2 (g) are useful for estimating the optimal number of 
dimensions. 

Indeed, Table 1 shows the agreement rates, of the choices by q and the 
optimal number q* := argmax g ARI(q), with each setting for 1000 data sets. 

8. Conclusion 

This paper proves the strong consistency of RKM clusterings under i.i.d. 
sampling on the basis of the proof for the conventional fc-means clustering 
provided by Pollard (1981). Since our proof is based on the usual Blum- 
DeHardt uniform SLLN which requires only stationarity and ergodicity (e.g., 
Peskir, 2000), we can obtain the same results for a stationary ergodic process. 

Under the i.i.d. condition, we can derive the rate of convergence for the 
convergence of the empirically optimal clustering scheme if the support of 
the population distribution is bounded; that is, P(||.Xi|| 2 < B) = 1 for some 
B > 0. From Theorem 1 in Linder et al. (1994), for all e > and n(e/8B) 2 > 2 
we can obviously obtain 



where <f>(r) = r 2 , K k p) := {R C W | < k}, and KM(F, P) := 

J min^gi? \\x — f\\ 2 P(dx). 

Considering the relationship between the conventional /c-means clustering 
and RKM clustering, the results presented in this paper are applicable to 



P[\m k (P n ) 



m k (P)\ >e]<2P sup |$(0, P n ) - $(9, P)\ > e 



< 2P sup \KM(F, P n ) - KM(F, P)\>e 




Y. Terada/ 'Consistency of RKM Clustering 



25 



the conventional fc-means clustering. The related methods of RKM cluster- 
ing include factorial /c-means (FKM) clustering proposed by Vichi & Kiers 
(2001). In Terada (2013), the strong consistency of FKM clusterings under 
i.i.d. sampling (or for a stationary ergodic process) has been proven. The 
form of sufficient conditions for the strong consistency of FKM clustering is 
similar to the case of RKM clusterings. Moreover, the new simple criterion for 
determining the number of dimensions under given cluster number and the 
consistent estimator of the criterion have been proposed. Through numerical 
experiments, the effectiveness of the criterion has been illustrated. 

Future studies in this regard will examine the rate of convergence of estima- 
tors of RKM clustering and will propose the criterion required to determine 
the number of clusters. 
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Appendix A: The existence of ©' 

The existence of the minimum points of $(•, P) are proven. 

Lemma 3. Suppose that J <p(\\x\\)P(dx) < oo. There exists M > such 
that, for all F' E TZ k satisfying F' n B q {M) = 0, 

inf $(F', A, P) > inf $(0, P). 

AGO(pxq) eee*(M) 

Proof. Argue by contradiction, suppose that for any M > there exists 
F' e TZ k such that F' n B q (M) = and 

inf $(F', A, P) < inf $(0, P). (16) 

AeO(pxq) 0eO*(M) 



Select an r > such that the ball B p (r) has a positive P- measure, i.e., 
P(B p (r)) > 0. A sufficient large M is selected such that M > r and inequality 
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(8) is satisfied. From the inequality (16), 

/ <J>(\\x\\)P(x) > inf M6 : P) > inf MF>, A, P) 
> (f)(M -r)P(B p (r)). 

This is a contradiction. □ 

Lemma 4. Suppose that J (j)(\\x\\)P(dx) < oo and rrij(P) > rrik(P) for 
j = 1, 2, k — 1. There exists M > snc/i i/iai, /or any F' G 7?.^ 

satisfying F' £ B q (5M), 

inf $(P', A, P) > inf $(0, P). 

Aeo( P xq) 0ee*(5M) 

Proof. Select a sufficient large value M > to satisfy the inequalities (8) 
and (9). To obtain a contradiction, suppose that for all M > there exists 
F' G TZ k satisfying F' (jL B q (5M) and 

inf $(F', A, P) < inf $(0, P). 

AeO(pxg) eee*(5M) 

Let 7?4 be the set of such F' so that 

m k (P) = inf $(0, P). 

0e7e' fe x0(px 9 ) 

From Lemma 3, each F' G 7£' fe includes at least one element in B q (M), say 
fi- 

For any x satisfying ||cc|| < 2M and any A G C(p x g), 
||sc - > 3M for all / £ P g (5M) 

and 

||ac-A^||<3M for all g G B q {M). 

Let P* denote the set obtained by deleting all elements outside B q (5M) from 
F'. Then, 



mm <j ) (\\x-Af\\)P(dx) = / min 0(11* - Af\\)P(dx) 

x\\<2M J\\x\\<2M J £I " 
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SillCe !\\ X \\>2M^W X ~ A ^W) P ^ dx ) ^ X I\\x\\>2M ( f , (\\ X \\) P ( dx )^ We ° btail1 

$(P', A, P) + X f cj>(\\x\\)P(dx) 

J\\x\\>2M 

> [ mm <f>(\\x-Af\\)P(dx)+ [ <f>(\\x - A/i||)P(dx) 

J||a:||<2M ei< * -/||a;||>2M 

>$(P*, A, P)>m fc _!(P) 
for all A G C(p x q). Therefore, we obtain 

m k {P) + e > m fc _i(P). 

This contradicts rrik(P) + e < m k ^i(P). □ 

We will denote the essential parameter space by that is, 0^ := 0£(5M). 
By Lemma 4, 

inf $(0, P) = inf $(0, P) 

0es fe eee fc 

and there is no 9 G (7£ fc \ TZ* k {hM)) x 0(p x g) satisfying m fc (P) = $(0, P). 

Proof of Proposition 3. First, it is proven that there exists a sequence {6 n } n evi 
in fc such that $(0 n , P) -»■ m fe (P) as n -»■ oo. Let C : = {$(0, P) \ 9 e Q k } 
and m k (P) = inf C. For all x > m k (P), there exists c < x in C. Write 
x n := m k (P) + l/n and C„ := {c G C | c < x n }. Let ^P(C) be the power set of 
C. From the axiom of choice, there exists a function / : ^P(C) \{0} — > C such 
that /(P) G P for all P G *P(C)\{0}. Let c n := f(C n ) and x n > c n > m k (P). 
Thus, c n — > m k (P) as n — > oo. Using the axiom of choice, a sequence {# n } ne N 
can be selected such that $(0 n , P) — > m k (P) as n -)■ oo. 

From the compactness of there exists a convergent subsequence of 
{6> n } nS N, say {# mi }i S N- Let 6 1 * G Q k denote the limit of such subsequence, that 
is, 9 m% — > 9* as i — > oo. Because $(■, P) is continuous on Q k , $(6**, P) = 
m fc (P). That is, 0' ^ 0. □ 

The next corollary ensures the identification condition for $(•, P). 

Corollary 1. Let 0' := {6> fc G 0fe | P) = m k (P)}. Assume the as- 

sumptions of Lemma 4. T/ien, 

inf $(#, P) > inf $(0, P) for a// e > 0. 

9£S k :d(9, B')>e 6*68' 



Y. Terada/ 'Consistency of RKM Clustering 



29 



Proof. Let e := {9 G Qfc | d(9, O') > e}. To obtain a contradiction, suppose 
that there exists e > such that inf0 g e P) = infgge' P)- Like in 
the proof of Proposition 3, there exists a sequence {9 n } n£N on 6 e satisfying 
$>(8 n , P) —> m,k(P) as n — >■ oo. From the compactness of there exists 
a convergent subsequence of {O n }neN, sa Y {#m,i}ieN- Let #* G 0fc denote the 
limit of such subsequence and $(#*, P) = m k (P), that is, 9* G 9'. On the 
other hand, d(9 mi , 9*) < e for sufficiently large i G N because 6* m4 — > 0* as i — >■ 
oo. Thus, mj ^ G e for sufficiently large z G N. This is a contradiction. □ 



